This dataset is taken from a research explained here.
The goal of the research is to help the auditors by building a classification model that can predict the fraudulent firm on the basis the present and historical risk factors. The information about the sectors and the counts of firms are listed respectively as Irrigation (114), Public Health (77), Buildings and Roads (82), Forest (70), Corporate (47), Animal Husbandry (95), Communication (1), Electrical (4), Land (5), Science and Technology (3), Tourism (1), Fisheries (41), Industries (37), Agriculture (200).
There are two csv files to present data. Please merge these two datasets into one dataframe. All the steps should be done in Python. Please don't make any changes in csv files. Consider Audit_Risk as target columns for regression tasks, and Risk as the target column for classification tasks.
Many risk factors are examined from various areas like past records of audit office, audit-paras, environmental conditions reports, firm reputation summary, on-going issues report, profit-value records, loss-value records, follow-up reports etc. After in-depth interview with the auditors, important risk factors are evaluated and their probability of existence is calculated from the present and past records.
Hooda, Nishtha, Seema Bawa, and Prashant Singh Rana. 'Fraudulent Firm Classification: A Case Study of an External Audit.' Applied Artificial Intelligence 32.1 (2018): 48-64.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
data_auditRisk = pd.read_csv("audit_risk.csv")
data_trial = pd.read_csv("trial.csv")
data_auditRisk.T
data_auditRisk.info()
data_trial.info()
data_auditRisk.head()
data_trial.head()
data_auditRisk.rename(columns={'PROB': 'PROB1'}, inplace=True)
print(data_auditRisk.columns)
print(data_trial.columns)
Dropping the Detection Risk column as it has a variance of zero.
data_auditRisk = data_auditRisk.drop("Detection_Risk", axis = 1)
data_trial.T
data_trial['Risk'].unique()
data_auditRisk.head()
data_auditRisk.describe()
data_auditRisk["Money_Value"].head()
Common columns:
c_cols = ['Sector_score', 'LOCATION_ID', 'PARA_A', 'Score_A', 'PARA_B', 'Score_B', 'TOTAL', 'numbers', 'Money_Value', 'History', 'Score', 'Risk']
Columns in trial_df but not in audit_risk_df:
only_in_trial_cols = ['Marks', 'MONEY_Marks', 'District', 'Loss', 'LOSS_SCORE', 'History_score']
data_auditRisk["Score_A"] = data_auditRisk["Score_A"]*10
data_auditRisk["Score_B"] = data_auditRisk["Score_B"]*10
dfwith_risk = ['Sector_score', 'LOCATION_ID', 'PARA_A', 'Score_A', 'PARA_B', 'Score_B', 'TOTAL', 'numbers', 'Money_Value', 'History','Score', 'Risk']
dfwithout_risk = ['Sector_score', 'LOCATION_ID', 'PARA_A', 'Score_A', 'PARA_B', 'Score_B', 'TOTAL', 'numbers', 'Money_Value', 'History','Score']
dfwith_risk_upper = [x.upper() for x in dfwith_risk]
dfwithout_risk_upper = [x.upper() for x in dfwithout_risk]
audit_names = data_auditRisk.columns
audit_names_upper = [x.upper() for x in audit_names]
data_auditRisk.columns = audit_names_upper
trial_names = data_trial.columns
trial_names_upper = [x.upper() for x in trial_names]
data_trial.columns = trial_names_upper
# c_with_risk_cols will result in an inner merge (~580 observations on dropping duplicates)
# c_without_risk_cols will result in 763 observations after dropping duplicates but with two target variables which can be reduced using a Logical OR in case
Mainrisk_df = data_auditRisk.merge(data_trial, on=dfwithout_risk_upper)
Mainrisk_df.shape
Mainrisk_df = Mainrisk_df.drop_duplicates()
Mainrisk_df.shape
Mainrisk_df.columns
##Central Imputation
Mainrisk_df['MONEY_VALUE'] = Mainrisk_df["MONEY_VALUE"].fillna(Mainrisk_df["MONEY_VALUE"].mean())
Mainrisk_df.isnull().sum()
Location ID that holds three string values and rest in the form of numeric values is a categorical attribute
Mainrisk_df = Mainrisk_df.copy()
Mainrisk_df[['LOCATION_ID']] = Mainrisk_df[['LOCATION_ID']].astype('category')
# Check type conversions
Mainrisk_df.dtypes
Here since there are 3 string values for LOCATION_ID which are replaced with a unique numbers.
Mainrisk_df["LOCATION_ID"]= Mainrisk_df["LOCATION_ID"].replace("LOHARU", 45)
Mainrisk_df["LOCATION_ID"]= Mainrisk_df["LOCATION_ID"].replace("NUH", 46)
Mainrisk_df["LOCATION_ID"]= Mainrisk_df["LOCATION_ID"].replace("SAFIDON", 47)
Mainrisk_df["LOCATION_ID"].unique()
Mainrisk_df.describe()
Here in the above description it is observed that some of the columns like PARA_B, TOTAL, RSIK_B are having the outliers as their respective maximum values are greater than the value of their 3rd quartile. Hence there are outliers present for these columns.
We donot touch the target column on which we perform regression
plt.boxplot(Mainrisk_df['AUDIT_RISK'])
plt.boxplot(Mainrisk_df['PARA_B'])
Here we observe that there is only one observation which is the outlier for the column PARA_B.
Mainrisk_df[Mainrisk_df['PARA_B']==1264.630000]
Mainrisk_df.shape
Mainrisk_df_rmout = Mainrisk_df[Mainrisk_df.PARA_B != 1264.630000]
plt.boxplot(Mainrisk_df_rmout['PARA_B'])
Mainrisk_df_rmout[['MONEY_VALUE','RISK_D']].describe()
Mainrisk_df_rmout[(Mainrisk_df_rmout['INHERENT_RISK'] == 622.838000) | (Mainrisk_df_rmout['TOTAL'] == 191.360000) | (Mainrisk_df_rmout['MONEY_VALUE'] == 935.030000) |(Mainrisk_df_rmout['RISK_D'] == 561.018000)]
Mainfinal_df = Mainrisk_df_rmout[(Mainrisk_df_rmout['INHERENT_RISK'] != 622.838000) & (Mainrisk_df_rmout['TOTAL'] != 191.360000) & (Mainrisk_df_rmout['MONEY_VALUE'] != 935.030000) & (Mainrisk_df_rmout['RISK_D'] != 561.018000)]
Mainfinal_df
plt.boxplot(Mainfinal_df['INHERENT_RISK'])
Mainfinal_df.shape
Here it is observed that on merging the two dataframes there would be two columns formed for the risk as the risk columns and values are different for both the dataframes. Hence by performing the OR operation the RISK column is built.
Mainfinal_df['RISK'] = Mainfinal_df['RISK_x'] | Mainfinal_df['RISK_y']
Mainfinal_df = Mainfinal_df.drop(['RISK_x','RISK_y'],axis=1)
Mainfinal_df
Mainfinal_df.describe()
Mainfinal_df.info()
import pandas as pd
import numpy as np
Mainfinal_df.columns
It is observed that District loss and district have the same values and same affect on the target so dropping one of the two attributes (which is DISTRICT) is done. It is also observed that MONEY_MARKS and SCORE_MV differ by a constant factor which is the multiplication of 10
(i.e., SCORE_MV*10 is MONEY_MARKS), so dropping the MONEY_MARKS attribute.
Mainfinal_df = Mainfinal_df.drop(['MONEY_MARKS','DISTRICT'],axis=1)
Mainfinal_df.columns
Plotting the Sector_score vs Risk
Here it is observed that the Risk is 1 for the sector_score between 2.72 and 3.89
sns.countplot(x='SECTOR_SCORE',data=Mainfinal_df[['SECTOR_SCORE','RISK']],
hue="RISK").set_title("Sector_score Vs Risk")
plt.xticks(rotation=45)
It can be observed that the risk is 1 for location with id 8,23,2, and 16
fig = plt.figure(figsize=(20,20))
sns.countplot(x='LOCATION_ID',data=Mainfinal_df[['LOCATION_ID','RISK']],
hue="RISK").set_title("LOCATION_ID Vs RISK")
plt.xticks(rotation=45)
It is observed that for the zero history the risk is less i.e., risk is zero,
##fig = plt.figure(figsize=(20,20))
sns.countplot(x='HISTORY',data=Mainfinal_df[['HISTORY','RISK']],
hue="RISK").set_title("HISTORY Vs RISK")
plt.xticks(rotation=45)
It is observed that the District-loss =2 has less risk as risk=0.
##fig = plt.figure(figsize=(20,20))
sns.countplot(x='DISTRICT_LOSS',data=Mainfinal_df[['DISTRICT_LOSS','RISK']],
hue="RISK").set_title("DISTRICT_LOSS Vs RISK")
plt.xticks(rotation=45)
Here numbers refers to number of transactions. Here the risk=0 for numbers of transactions =5.
##fig = plt.figure(figsize=(20,20))
sns.countplot(x='NUMBERS',data=Mainfinal_df[['NUMBERS','RISK']],
hue="RISK").set_title("NUMBERS Vs RISK")
plt.xticks(rotation=45)
It is observed that the value_counts for risk=0 and risk=1 are in the same degree. So it can be said that there is no class imbalance problem.
##fig = plt.figure(figsize=(20,20))
sns.countplot(x='RISK',data=Mainfinal_df[['RISK']],
hue="RISK").set_title(" NO RISK VS RISK")
plt.xticks(rotation=45)
# Regression Relationship
Mainfinal_df[['LOCATION_ID','RISK']] = Mainfinal_df[['LOCATION_ID','RISK']].astype('int')
sns.pairplot(Mainfinal_df, x_vars=['SECTOR_SCORE','LOCATION_ID','PARA_A'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['RISK_A', 'PARA_B','SCORE_B'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['RISK_B', 'TOTAL', 'NUMBERS'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['SCORE_B.1', 'RISK_C','MONEY_VALUE'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6,plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['SCORE_MV', 'RISK_D', 'DISTRICT_LOSS'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['PROB1', 'RISK_E','HISTORY'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['PROB', 'RISK_F', 'SCORE'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['INHERENT_RISK','CONTROL_RISK', 'MARKS'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['LOSS'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
sns.pairplot(Mainfinal_df, x_vars=['LOSS_SCORE','HISTORY_SCORE', 'RISK'], y_vars=["AUDIT_RISK"],
height=4.2, aspect=1, kind="reg", size = 6, plot_kws={'line_kws':{'color':'red'}})
Here it is observed from the above plot that there is good linear-correlation between INHERENT_RISK and AUDIT_RISK when RISK=1
Here it is observed that the linear correlation between INHERENT_RISK and AUDIT_RISK when RISK=0 is not so good.
From the above plots it is said that, anything above INHERENT_RISK of 3.5 it is said that the risk is high which is 1. The distribution is also different for risk=0 and risk=1.
The data is splitted into to_scale_x_df and y_final_reg which is the target. The features scaling is performed using MinMaxScaler and StandardScaler as well.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
Mainfinal_df1 = Mainfinal_df.copy()
mm_scaler = MinMaxScaler()
std_scaler = StandardScaler()
y_regFinal = Mainfinal_df['AUDIT_RISK']# Regression y
y_clfFinal = Mainfinal_df['RISK'] # Classification y
scale_x_df = Mainfinal_df1.drop(["AUDIT_RISK","RISK"], axis =1)
mm_x_df = scale_x_df.copy()
std_x_df = scale_x_df.copy()
num_cols = ['SECTOR_SCORE', 'LOCATION_ID','PARA_A', 'SCORE_A', 'RISK_A', 'PARA_B',
'SCORE_B', 'RISK_B', 'TOTAL', 'NUMBERS', 'SCORE_B.1', 'RISK_C',
'MONEY_VALUE', 'SCORE_MV', 'RISK_D', 'DISTRICT_LOSS', 'PROB1', 'RISK_E',
'HISTORY','RISK_F', 'SCORE', 'INHERENT_RISK', 'CONTROL_RISK',
'MARKS', 'LOSS','PROB', 'LOSS_SCORE', 'HISTORY_SCORE']
num_cols = [x.upper() for x in num_cols]
mm_x_df[num_cols] = mm_scaler.fit_transform(mm_x_df[num_cols]) # MinMax scaled X
std_x_df[num_cols] = std_scaler.fit_transform(std_x_df[num_cols]) # Std scaled X
X=mm_x_df[num_cols]
y=y_regFinal
X.columns
X.shape
y.shape
Vizualising the feature relationship in the data
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
attributes = Mainfinal_df.columns[:6]
scatter_matrix(X[attributes], figsize = (15,15), c = y, alpha = 0.8, marker = 'O')
# heatmap
# Correlation matrix - linear relation among independent attributes and with the Target attribute
plt.figure(figsize = (25,25))
sns.heatmap(Mainfinal_df.corr(), square = True, linecolor = 'red', annot = True)
Mainfinal_df.shape
The data scaling and cleaning is done. Applying the regression models for the data.
Splitting our data into train , validating set and testing data sets
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=0)
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
print("Size of training set: {} size of validation set: {} size of test set:"
" {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
The below model represents the linear regression; the target variable is audit_risk for this regression.
The data is divided into training,validation and testing which helps in overcoming the overfitting problem.
The model provides a training (training and validation) and test scores of 0.8344 & 0.8354 which shows the similar output on both the sets with a least difference which explains a good model.
The root mean square (rmse) which defines the percentage of error present in the data. The mse & rmse of the below 1 for the linear regression model.
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=0)
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
print("Size of training set: {} size of validation set: {} size of test set:"
" {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
lreg.fit(X_trainval, y_trainval)
print('Train score: %.4f'%lreg.score(X_trainval, y_trainval))
print('Test score: %.4f'%lreg.score(X_test, y_test))
# The coefficients
#print('Coefficients: \n', lreg.coef_)
predictions = lreg.predict(X_test)
plt.scatter(y_test, predictions)
#plot 1
plt.scatter(y_test, predictions)
#plot 2: residual
plt.scatter(predictions, predictions - y_test, c = 'b')
plt.ylabel('Residuals')
plt.title('Residuals plot of test data(dark blue) and predicted-Test Data(Orange)')
plt.show()
# calculate these metrics by hand!
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
mse=metrics.mean_squared_error(y_test, predictions)
rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions))
sns.distplot((y_test-predictions), bins=500)
coeffecients = pd.DataFrame(lreg.coef_, X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
f = open('regressionOutput_GROUP19.csv', 'w')
line = 'MODEL_NAME, TRAIN_SCORE, TEST_SCORE, MAE/BEST PARAMS, MSE, RMSE,\n'
f.write(line)
f.close() # to close the buffer as to not send garbage val into it
# adding vals to excel file that you created.
line = 'LinearRegression' + ',' +str(lreg.score(X_trainval, y_trainval))+','+str(lreg.score(X_test, y_test))+','+str(metrics.mean_absolute_error(y_test, predictions))+','+str(mse)+','+str(rmse) +'\n'
f = open('regressionOutput_GROUP19.csv','a')
f.write(line)
f.close()
The K-Nearest Neighbors regression is performed on the data. GridSearch CrossValidation is used to find the number of nearest neighbors. The following model provides with the best parameter as 3 for the n_neighbors.The train and test scores for the following models are :
from sklearn.neighbors import KNeighborsRegressor
%matplotlib inline
train_score_array = []
test_score_array = []
for k in range(1,10):
knn_reg = KNeighborsRegressor(k)
knn_reg.fit(X_trainval, y_trainval)
train_score_array.append(knn_reg.score(X_trainval, y_trainval))
test_score_array.append(knn_reg.score(X_test, y_test))
x_axis = range(1,10)
plt.plot(x_axis, train_score_array, c = 'g', label = 'Train Score')
plt.plot(x_axis, test_score_array, c = 'b', label = 'Test Score')
plt.legend()
plt.xlabel('k')
plt.ylabel('MSE')
# split data into train+validation set and test set
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=0)
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
print("Size of training set: {} size of validation set: {} size of test set:"
" {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
best_score = 0
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
from sklearn.model_selection import GridSearchCV
#param_grid = dict(k_range' : [1,3,5,7,9,12,15,17,20])
k_range = [1,3,5,7,9,12,15,17,20]
weights_range = ['uniform','distance']
param_grid = dict(n_neighbors=k_range, weights = weights_range)
#grid_search = GridSearchCV(knn, param_grid, cv=10, return_train_score=True)
grid_search = GridSearchCV(knn, param_grid, cv=10, return_train_score=True)
grid_search.fit(X_trainval, y_trainval)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
%matplotlib inline
train_score_array = []
test_score_array = []
knn_reg = KNeighborsRegressor(3)
knn_reg.fit(X_trainval, y_trainval)
train_score_array.append(knn_reg.score(X_trainval, y_trainval))
test_score_array.append(knn_reg.score(X_test, y_test))
print(train_score_array)
print(test_score_array)
from sklearn import metrics
knn_tr_pred = knn_reg.predict(X_trainval)
knn_test_pred = knn_reg.predict(X_test)
knn_tr_mse = metrics.mean_squared_error(y_trainval, knn_tr_pred)
knn_tr_rmse = np.sqrt(knn_tr_mse)
knn_test_mse = metrics.mean_squared_error(y_test, knn_test_pred)
knn_test_rmse = np.sqrt(knn_test_mse)
print('train mse: ', knn_tr_mse)
print('train rmse: ', knn_tr_rmse)
print('test mse: ', knn_test_mse)
print('test rmse: ', knn_test_rmse)
print('\ntrain score: ', knn_reg.score(X_trainval, y_trainval))
print('test score: ', knn_reg.score(X_test, y_test) )
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'KNeighborsRegressor,' + str(knn_reg.score(X_trainval, y_trainval)) + ',' + str(knn_reg.score(X_test, y_test)) +','+str('bestparam n = 3')+','+str(knn_test_mse)+','+str(knn_test_rmse)+ '\n'
f.write(line)
f.close()
Ridge Regression: Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients It takes ‘alpha’ as a parameter on initialization.High alpha values can lead to significant underfitting.After Performing the GridSearch Cross Validation the paramter of alpha :0.001 is declared
from sklearn.linear_model import Ridge
x_range = [0.01, 0.1, 1, 10, 100]
train_score_list = []
test_score_list = []
for alpha in x_range:
ridge = Ridge(alpha)
ridge.fit(X_trainval,y_trainval)
train_score_list.append(ridge.score(X_trainval,y_trainval))
test_score_list.append(ridge.score(X_test, y_test))
print(train_score_list)
print(test_score_list)
%matplotlib inline
import numpy as np
x_range1 = np.linspace(0.001, 1, 100).reshape(-1,1)
x_range2 = np.linspace(1, 10000, 10000).reshape(-1,1)
x_range = np.append(x_range1, x_range2)
coeff = []
for alpha in x_range:
ridge = Ridge(alpha)
ridge.fit(X_trainval,y_trainval)
coeff.append(ridge.coef_ )
coeff = np.array(coeff)
for i in range(0,26):
plt.plot(x_range, coeff[:,i], label = 'feature {:d}'.format(i))
plt.axhline(y=0, xmin=0.001, xmax=9999, linewidth=1, c ='gray')
plt.xlabel(r'$\alpha$')
plt.xscale('log')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.5),
ncol=3, fancybox=True, shadow=True)
plt.show()
from sklearn.linear_model import Ridge
import numpy as np
for alpha in [0.001, 0.01, 0.1, 1, 10, 100]:
ridge =Ridge()
scores = cross_val_score(ridge, X_trainval, y_trainval, cv=5)
score = np.mean(scores)
if score > best_score:
best_score = score
best_parameters = {'alpha': alpha}
ridge = Ridge(**best_parameters)
ridge.fit(X_trainval, y_trainval)
test_score = ridge.score(X_test, y_test)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: ", best_parameters)
print("Test set score with best parameters: {:.2f}".format(test_score))
ridge = Ridge(alpha = 0.001)
ridge.fit(X_trainval,y_trainval)
print('Train score: {:.4f}'.format(ridge.score(X_trainval,y_trainval)))
print('Test score: {:.4f}'.format(ridge.score(X_test, y_test)))
from sklearn import metrics
ridge_tr_pred = ridge.predict(X_trainval)
ridge_test_pred =ridge.predict(X_test)
ridge_tr_mse = metrics.mean_squared_error(y_trainval,ridge_tr_pred)
ridge_tr_rmse = np.sqrt(ridge_tr_mse)
ridge_test_mse = metrics.mean_squared_error(y_test, ridge_test_pred)
ridge_test_rmse = np.sqrt(ridge_test_mse)
print('train mse: ', ridge_tr_mse)
print('train rmse: ', ridge_tr_rmse)
print('test mse: ', ridge_test_mse)
print('test rmse: ', ridge_test_rmse)
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'Ridge,' + str(format(ridge.score(X_trainval,y_trainval))) + ',' + str(format(ridge.score(X_test, y_test))) +','+str('alpha = 0.01')+','+str(ridge_test_mse)+','+str(ridge_test_rmse)+ '\n'
f.write(line)
f.close()
from sklearn.linear_model import Lasso
x_range = [0.01, 0.1, 1, 10, 100]
train_score_list = []
test_score_list = []
for alpha in x_range:
lasso = Lasso(alpha)
lasso.fit(X_trainval,y_trainval)
train_score_list.append(lasso.score(X_trainval,y_trainval))
test_score_list.append(lasso.score(X_test, y_test))
plt.plot(x_range, train_score_list, c = 'g', label = 'Train Score')
plt.plot(x_range, test_score_list, c = 'b', label = 'Test Score')
plt.xscale('log')
plt.legend(loc = 3)
plt.xlabel(r'$\alpha$')
%matplotlib inline
x_range1 = np.linspace(0.001, 1, 1000).reshape(-1,1)
x_range2 = np.linspace(1, 1000, 1000).reshape(-1,1)
x_range = np.append(x_range1, x_range2)
coeff = []
for alpha in x_range:
lasso = Lasso(alpha)
lasso.fit(X_trainval,y_trainval)
coeff.append(lasso.coef_ )
coeff = np.array(coeff)
for i in range(0,13):
plt.plot(x_range, coeff[:,i], label = 'feature {:d}'.format(i))
plt.axhline(y=0, xmin=0.001, xmax=9999, linewidth=1, c ='gray')
plt.xlabel(r'$\alpha$')
plt.xscale('log')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.5),
ncol=3, fancybox=True, shadow=True)
plt.show()
from sklearn.linear_model import Lasso
import numpy as np
for alpha in [0.001, 0.01, 0.1, 1, 10, 100]:
lasso =Lasso()
# perform cross-validation
scores = cross_val_score(ridge, X_trainval, y_trainval, cv=5)
# compute mean cross-validation accuracy
score = np.mean(scores)
# if we got a better score, store the score and parameters
if score > best_score:
best_score = score
best_parameters = {'alpha': alpha}
# rebuild a model on the combined training and validation set
lasso = Lasso(**best_parameters)
lasso.fit(X_trainval, y_trainval)
test_score = lasso.score(X_test, y_test)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: ", best_parameters)
print("Test set score with best parameters: {:.2f}".format(test_score))
lasso = Lasso(alpha = 0.001)
lasso.fit(X_trainval,y_trainval)
print('Train score: {:.4f}'.format(lasso.score(X_trainval,y_trainval)))
print('Test score: {:.4f}'.format(lasso.score(X_test, y_test)))
from sklearn.linear_model import Lasso
train_score_list = []
test_score_list = []
lasso = Lasso(alpha=0.001)
lasso.fit(X,y)
test_score_list.append(lasso.score(X, y))
print(test_score_list)
# best param
from sklearn import metrics
lasso_tr_pred = lasso.predict(X_trainval)
lasso_test_pred =lasso.predict(X_test)
lasso_tr_mse = metrics.mean_squared_error(y_trainval,lasso_tr_pred)
lasso_tr_rmse = np.sqrt(lasso_tr_mse)
lasso_test_mse = metrics.mean_squared_error(y_test, lasso_test_pred)
lasso_test_rmse = np.sqrt(lasso_test_mse)
print('train mse: ', lasso_tr_mse)
print('train rmse: ', lasso_tr_rmse)
print('test mse: ', lasso_test_mse)
print('test rmse: ', lasso_test_rmse)
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'Lasso,' + str(lasso.score(X_trainval,y_trainval)) + ',' + str(test_score_list) +','+str('alpha=0.001')+','+str(lasso_test_mse)+','+str(lasso_test_rmse)+ '\n'
f.write(line)
f.close()
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
train_score_list = []
test_score_list = []
for n in range(1,3):
poly = PolynomialFeatures(n)
X_train_poly = poly.fit_transform(X_trainval)
X_test_poly = poly.transform(X_test)
lreg.fit(X_train_poly, y_trainval)
train_score_list.append(lreg.score(X_train_poly, y_trainval))
test_score_list.append(lreg.score(X_test_poly, y_test))
trainScore = lreg.score(X_train_poly, y_trainval)
testScore = lreg.score(X_test_poly, y_test)
%matplotlib inline
x_axis = range(1,3)
plt.plot(x_axis, train_score_list, c = 'g', label = 'Train Score')
plt.plot(x_axis, test_score_list, c = 'b', label = 'Test Score')
plt.xlabel('degree')
plt.ylabel('accuracy')
plt.legend()
print(train_score_list)
print(test_score_list)
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'Poly,' + str(trainScore) + ',' + str(testScore) +','+str('degree 3')+'\n'
f.write(line)
f.close()
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
import os
import pandas as pd
import mglearn
DT_r = DecisionTreeRegressor()
DT_r.fit(X_trainval,y_trainval)
DT_tr_pred = DT_r.predict(X_trainval)
DT_test_pred = DT_r.predict(X_test)
lreg = LinearRegression().fit(X_trainval, y_trainval)
pred_lr = lreg.predict(X_trainval)
pred_test =lreg.predict(X_test)
print('Train Score:',DT_r.score(X_trainval,y_trainval))
print('Train Score:',DT_r.score(X_test, y_test))
##train_score_list.append(lreg.score(pred_lr, y_trainval))
##test_score_list.append(lreg.score(pred_test, y_test))
##print(train_score_list)
##print(test_score_list)
pred_lr
pred_test
from sklearn import metrics
pred_lr = lreg.predict(X_trainval)
pred_test =lreg.predict(X_test)
pred_lr_mse = metrics.mean_squared_error(y_trainval,pred_lr)
pred_lr_rmse = np.sqrt(pred_lr_mse)
pred_test_mse = metrics.mean_squared_error(y_test, pred_test)
pred_test_rmse = np.sqrt(pred_test_mse)
print('train mse: ', pred_lr_mse)
print('train rmse: ', pred_lr_rmse)
print('test mse: ', pred_test_mse)
print('test rmse: ', pred_test_rmse)
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'DT regressor,' + str(DT_r.score(X_trainval,y_trainval))+ ',' +str(DT_r.score(X_test, y_test))+','+str('____')+','+str(pred_test_mse)+','+str(pred_test_rmse)+','+str('best Cross-ValidScore =0.79')+'\n'
f.write(line)
f.close()
from sklearn.model_selection import GridSearchCV
#param_grid = dict(k_range' : [1,3,5,7,9,12,15,17,20])
from sklearn import svm
from sklearn.svm import SVR
import numpy as np
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
svm_r = svm.SVR()
grid_search = GridSearchCV(svm_r, param_grid, cv=5, return_train_score=True,)
#grid_search = GridSearchCV(knn, param_grid, cv=10, return_train_score=True)
grid_search = GridSearchCV(svm_r, param_grid, cv=10, return_train_score=True)
grid_search.fit(X_trainval, y_trainval)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
# SVM Linear
#X_trainval = np.array(X_trainval)
#y_trainval = np.array(y_trainval)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import svm
svm_r = svm.SVR(kernel='linear', C = 100)
svm_r.fit(X_trainval, y_trainval)
svmr_tr_pred = svm_r.predict(X_trainval)
svmr_test_pred = svm_r.predict(X_test)
print('Train Score for Linear Kernel:',svm_r.score(X_trainval,y_trainval))
print('Train Score for Linear Kernel:',svm_r.score(X_test, y_test))
svm_tr_mse = metrics.mean_squared_error(y_trainval, svmr_tr_pred)
svm_tr_rmse = np.sqrt(svm_tr_mse)
svm_test_mse = metrics.mean_squared_error(y_test, svmr_test_pred)
svm_test_rmse = np.sqrt(svm_test_mse)
print('\ntrain mse: ', svm_tr_mse)
print('train rmse: ', svm_tr_rmse)
print('test mse: ', svm_test_mse)
print('test rmse: ', svm_test_rmse)
#y_pred = svclassifier.predict(X_test)
#from sklearn.metrics import classification_report, confusion_matrix
#print(confusion_matrix(y_test,y_pred))
#print(classification_report(y_test,y_pred))
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'SVM(SVR-Linear),' + str(svm_r.score(X_trainval,y_trainval)) + ',' + str(svm_r.score(X_test, y_test))+','+str('C = 100')+','+str(svm_test_mse)+','+str(svm_test_rmse)+','+str('best Cross Valid Score =0.79') +'\n'
f.write(line)
f.close()
# SVM RBF
svm_r = svm.SVR(kernel='rbf', C = 100)
svm_r.fit(X_trainval, y_trainval)
svmr_tr_pred = svm_r.predict(X_trainval)
svmr_test_pred = svm_r.predict(X_test)
print('Train Score for RBF kernel:',svm_r.score(X_trainval,y_trainval))
print('Train Score for RBF kernel:',svm_r.score(X_test, y_test))
svm_tr_mse = metrics.mean_squared_error(y_trainval, svmr_tr_pred)
svm_tr_rmse = np.sqrt(svm_tr_mse)
svm_test_mse = metrics.mean_squared_error(y_test, svmr_test_pred)
svm_test_rmse = np.sqrt(svm_test_mse)
print('\ntrain mse: ', svm_tr_mse)
print('train rmse: ', svm_tr_rmse)
print('test mse: ', svm_test_mse)
print('test rmse: ', svm_test_rmse)
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'SVM(SVR-RBF),' + str(svm_r.score(X_trainval,y_trainval)) + ',' + str(svm_r.score(X_test, y_test))+','+str('C = 100')+','+str(svm_test_mse)+','+str(svm_test_rmse)+','+str('best Cross Valid Score =0.79') +'\n'
f.write(line)
f.close()
## SVM Regressor - poly Kernel vs Linear
svm_r = svm.SVR(kernel='poly', C = 100, degree=3)
svm_r.fit(X_trainval, y_trainval)
svmr_tr_pred = svm_r.predict(X_trainval)
svmr_test_pred = svm_r.predict(X_test)
svm_tr_mse = metrics.mean_squared_error(y_trainval, svmr_tr_pred)
svm_tr_rmse = np.sqrt(svm_tr_mse)
svm_test_mse = metrics.mean_squared_error(y_test, svmr_test_pred)
svm_test_rmse = np.sqrt(svm_test_mse)
print('Train Score for Poly Kernel:',svm_r.score(X_trainval,y_trainval))
print('Train Score for Poly Kernel:',svm_r.score(X_test, y_test))
print('\ntrain mse: ', svm_tr_mse)
print('train rmse: ', svm_tr_rmse)
print('test mse: ', svm_test_mse)
print('test rmse: ', svm_test_rmse)
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'SVM(SVR-poly),' + str(svm_r.score(X_trainval,y_trainval)) + ',' + str(svm_r.score(X_test, y_test))+','+str('C = 100')+','+str(svm_test_mse)+','+str(svm_test_rmse)+','+str('best Cross Valid Score =0.79') +'\n'
f.write(line)
f.close()
from sklearn.ensemble import RandomForestRegressor
estimator = [20,50,70]
max_features_val= [10,15,20]
param_grid = dict(n_estimators=estimator, max_features=max_features_val)
print(param_grid)
rf_r = RandomForestRegressor()
rfgs = GridSearchCV(rf_r, param_grid = param_grid, cv=10, scoring='r2')
rfgs.fit(X,y)
print("Best Random forest Score:",rfgs.best_score_)
print("best Randome forest Parama:",rfgs.best_params_)
rf_r_best = RandomForestRegressor(n_estimators= 20,max_features= 20 )
rf_r_best.fit(X_trainval,y_trainval)
print("train score for Random Forest Reg:",rf_r_best.score(X_trainval,y_trainval))
print("test score for Random Forest Reg:",rf_r_best.score(X_test,y_test))
rf_train_pred = rf_r_best.predict(X_trainval)
rf_train_pred = rf_r_best.predict(X_test)
#rf_tr_mse = metrics.mean_squared_error(y_trainval, rf_train_pred)
#rf_tr_rmse = np.sqrt(rf_tr_mse)
#rf_test_mse = metrics.mean_squared_error(y_test, rf_train_pred)
#rf_test_rmse = np.sqrt(rf_test_mse)
#print('\ntrain mse: ', rf_tr_mse)
#print('train rmse: ', rf_tr_rmse)
#print('test mse: ', rf_test_mse)
#print('test rmse: ', rf_test_rmse)
features = rf_r_best.feature_importances_
cols = X_trainval.columns
feat_cols = pd.DataFrame(data=[cols,features])
feat_cols
f = open('regressionOutput_GROUP19.csv', 'a')
line = 'Random Forest Reg' + str(rf_r_best.score(X_trainval,y_trainval)) + ',' + str(rf_r_best.score(X_test,y_test))+','+str('n_estimators= 20,max_features= 20')+'\n'
f.write(line)
f.close()
L= ['SECTOR_SCORE', 'LOCATION_ID', 'PARA_A', 'SCORE_A', 'PARA_B', 'SCORE_B', 'TOTAL', 'NUMBERS', 'MONEY_VALUE', 'HISTORY','SCORE', 'RISK']
auditRisk_data = data_auditRisk.merge(data_trial, on=L)
auditRisk_data['RISK'].unique()
auditRisk_data = auditRisk_data.drop(["MONEY_MARKS","DISTRICT"], axis=1)
auditRisk_data.info()
auditRisk_data['MONEY_VALUE'] = auditRisk_data["MONEY_VALUE"].fillna(auditRisk_data["MONEY_VALUE"].mean())
# merged_data_sans_dup = merged_data_sans_dup["Money_Value"].fillna(merged_data_sans_dup["Money_Value"].median())
auditRisk_data.isnull().sum()
auditRisk_data.isna().any()
auditRisk_data["LOCATION_ID"]= auditRisk_data["LOCATION_ID"].replace("LOHARU", 45)
auditRisk_data["LOCATION_ID"]= auditRisk_data["LOCATION_ID"].replace("NUH", 46)
auditRisk_data["LOCATION_ID"]= auditRisk_data["LOCATION_ID"].replace("SAFIDON", 47)
auditRisk_data["LOCATION_ID"].unique()
data_out = auditRisk_data[auditRisk_data.PARA_B != 1264.630000]
data_out.shape
data_out[['MONEY_VALUE','RISK_D']].describe()
data_out[(data_out['INHERENT_RISK'] == 622.838000) | (data_out['TOTAL'] == 191.360000) | (data_out['MONEY_VALUE'] == 935.030000) |(data_out['RISK_D'] == 561.018000)]
MergedFinal_df = data_out[(data_out['INHERENT_RISK'] != 622.838000) & (data_out['TOTAL'] != 191.360000) & (data_out['MONEY_VALUE'] != 935.030000) & (data_out['RISK_D'] != 561.018000)]
MergedFinal_df.shape
MergedFinal_df.isnull().any()
MergedFinal_df['RISK'].unique()
MergedFinal_df.columns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
Audit_risk = MergedFinal_df.copy()
mm_scaler = MinMaxScaler()
std_scaler = StandardScaler()
y_final_reg = MergedFinal_df['AUDIT_RISK']# Regression y
y_final_clf = MergedFinal_df['RISK'] # Classification y
to_scale_x_df = Audit_risk.drop(["AUDIT_RISK","RISK"], axis =1)
mm_x_df = to_scale_x_df.copy()
std_x_df = to_scale_x_df.copy()
num_cols = ['SECTOR_SCORE', 'LOCATION_ID','PARA_A', 'SCORE_A', 'RISK_A', 'PARA_B',
'SCORE_B', 'RISK_B', 'TOTAL', 'NUMBERS', 'SCORE_B.1', 'RISK_C',
'MONEY_VALUE', 'SCORE_MV', 'RISK_D', 'DISTRICT_LOSS', 'PROB1', 'RISK_E',
'HISTORY', 'PROB', 'RISK_F', 'SCORE', 'INHERENT_RISK', 'CONTROL_RISK',
'MARKS', 'LOSS', 'LOSS_SCORE', 'HISTORY_SCORE']
num_cols = [x.upper() for x in num_cols]
mm_x_df[num_cols] = mm_scaler.fit_transform(mm_x_df[num_cols]) # MinMax scaled X
std_x_df[num_cols] = std_scaler.fit_transform(std_x_df[num_cols]) # Std scaled X
X=std_x_df[num_cols]
y=y_final_clf
X.columns
X.shape
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=0)
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
print("Size of training set: {} size of validation set: {} size of test set:"
" {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
best_score = 0
from sklearn.neighbors import KNeighborsClassifier
train_score_array = []
test_score_array = []
for k in range(1,20):
knn = KNeighborsClassifier(k)
knn.fit(X_trainval, y_trainval)
train_score_array.append(knn.score(X_trainval, y_trainval))
test_score_array.append(knn.score(X_test, y_test))
x_axis = range(1,20)
%matplotlib inline
plt.plot(x_axis, train_score_array, label = 'Train Score', c = 'g')
plt.plot(x_axis, test_score_array, label = 'Test Score', c='b')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.legend()
The below model depicts the knn classification. The procedures GridSearchCV is used in the model to derive the best parameters i.e., n_neighbors and weights. The best parameter for the KNN classifaction is n_neighbors=1 and which leaves the model with an accuracy and precision of 1 stating that the model is a perfect fit without any misclassifications.
from sklearn.model_selection import cross_val_score , GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
#param_grid = dict(k_range' : [1,3,5,7,9,12,15,17,20])
k_range = [1,3,5,7,9,12,15,17,20]
weights_range = ['uniform','distance']
param_grid = dict(n_neighbors=k_range, weights = weights_range)
#grid_search = GridSearchCV(knn, param_grid, cv=10, return_train_score=True)
grid_search = GridSearchCV(knn, param_grid, cv=10, return_train_score=True)
grid_search.fit(X_trainval, y_trainval)
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
from sklearn.neighbors import KNeighborsClassifier
train_score_array = []
test_score_array = []
knn = KNeighborsClassifier(3)
knn.fit(X_train, y_train)
train_score_array.append(knn.score(X_trainval, y_trainval))
test_score_array.append(knn.score(X_test, y_test))
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
knn_c_bst_clf = KNeighborsClassifier(n_neighbors=1)
knn_c_bst_clf.fit(X_trainval,y_trainval)
knnc_tr_pred = knn_c_bst_clf.predict(X_trainval)
knnc_test_pred = knn_c_bst_clf.predict(X_test)
print(knnc_tr_pred[4])
print("Train data")
print("Accuracy score: ", accuracy_score(y_trainval, knnc_tr_pred))
print("f1 score: ", f1_score(y_trainval, knnc_tr_pred))
print("recall score: ", recall_score(y_trainval, knnc_tr_pred))
print("precision: ", precision_score(y_trainval, knnc_tr_pred))
print(" ")
print("Test data")
print("Accuracy score: ", accuracy_score(y_test, knnc_test_pred))
print("f1 score: ", f1_score(y_test, knnc_test_pred))
print("recall score: ", recall_score(y_test, knnc_test_pred))
print("precision: ", precision_score(y_test, knnc_test_pred))
confusion = confusion_matrix(y_test, knnc_test_pred)
print("Confusion matrix:\n{}".format(confusion))
print(classification_report(y_test, knnc_test_pred))
f = open('classificationOutput_GROUP19.csv','a')
line = ' Model, Best Params, Accuracy Score, f1 score, Recall, Precision, ______ \n'
f.write(line)
f.close()
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
Accuracy=accuracy_score(y_test, knnc_test_pred)
f1score = f1_score(y_test, knnc_test_pred)
recallscore= recall_score(y_test, knnc_test_pred)
precision = precision_score(y_test, knnc_test_pred)
# adding vals to excel file that you created.
line = 'Knn Classification' + ',' +str("n_neighbors=1")+','+str(Accuracy )+','+str(f1score)+ ','+str(recallscore)+','+str(precision) +'\n'
f = open('classificationOutput_GROUP19.csv','a')
f.write(line)
f.close()
pd.crosstab(y_trainval, knnc_tr_pred)
pd.crosstab(y_test, knnc_test_pred)
from sklearn.metrics import classification_report
report = classification_report(y_test, knnc_test_pred)
print(report)
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
# fit a model
knn_c_bst_clf.fit(X_trainval,y_trainval)
# predict probabilities
probs = knn_c_bst_clf.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(y_test, probs)
print( thresholds )
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
Here we can say that the model is the perfect skill.
from sklearn.metrics import auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
# predict probabilities
probs = knn_c_bst_clf.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# predict class values
yhat = knn_c_bst_clf.predict(X_test)
# calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, probs)
# calculate F1 score
f1 = f1_score(y_test, yhat)
# calculate precision-recall AUC
auc = auc(recall, precision)
# calculate average precision score
ap = average_precision_score(y_test, probs)
print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap))
# plot no skill
pyplot.plot([0, 1], [0.5, 0.5], linestyle='--')
# plot the precision-recall curve for the model
pyplot.plot(recall, precision, marker='.')
# show the plot
pyplot.show()
The below model depicts the logistic regression. The procedures GridSearchCV is used in the model to derive the best parameters i.e., C and penalty. The best parameter for the logistic regression is C=1 and penalty = l1 , which leaves the model with an accuracy and precision of 1 stating that the model is a perfect fit without any misclassifications.
from sklearn.linear_model import LogisticRegression
c_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
train_score_l1 = []
train_score_l2 = []
test_score_l1 = []
test_score_l2 = []
for c in c_range:
log_l1 = LogisticRegression(penalty = 'l1', C = c)
log_l2 = LogisticRegression(penalty = 'l2', C = c)
log_l1.fit(X_trainval, y_trainval)
log_l2.fit(X_trainval, y_trainval)
train_score_l1.append(log_l1.score(X_trainval, y_trainval))
train_score_l2.append(log_l2.score(X_trainval, y_trainval))
test_score_l1.append(log_l1.score(X_test, y_test))
test_score_l2.append(log_l2.score(X_test, y_test))
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(c_range, train_score_l1, label = 'Train score, penalty = l1')
plt.plot(c_range, test_score_l1, label = 'Test score, penalty = l1')
plt.plot(c_range, train_score_l2, label = 'Train score, penalty = l2')
plt.plot(c_range, test_score_l2, label = 'Test score, penalty = l2')
plt.legend()
plt.xlabel('Regularization parameter: C')
plt.ylabel('Accuracy')
plt.xscale('log')
from sklearn.linear_model import LogisticRegression
c_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
penalty_mod = ['l1','l2']
log_reg = LogisticRegression()
#create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(penalty=penalty_mod,C=c_range)
print(param_grid)
#instantiation of the grid
log_reg_grid = GridSearchCV(log_reg,param_grid, cv=10, scoring='accuracy')
# fitting the grid
log_reg_grid.fit(X, y)
log_reg_grid.best_score_
log_reg_grid.best_params_
scores = cross_val_score(log_reg, X, y,cv=10)
# input arguments followed by X and Y
print("Cross-validation scores: {}".format(scores))
log_reg = LogisticRegression(penalty = 'l1', C = 100)
log_reg.fit(X_trainval, y_trainval)
print(log_reg.score(X_trainval, y_trainval))
print(log_reg.score(X_test, y_test))
logreg_tr_pred = log_reg.predict(X_trainval)
logreg_test_pred = log_reg.predict(X_test)
pd.crosstab(y_trainval, logreg_tr_pred)
print(log_reg.score(X_trainval, y_trainval))
pd.crosstab(y_test, logreg_test_pred)
print(log_reg.score(X_test, y_test))
from sklearn.metrics import classification_report
report = classification_report(y_test, logreg_test_pred)
print(report)
from sklearn.metrics import accuracy_score
print("Accuracy score: ", accuracy_score(y_trainval, logreg_tr_pred))
print("f1 score: ", f1_score(y_trainval, logreg_tr_pred))
print("recall score: ", recall_score(y_trainval, logreg_tr_pred))
print("precision: ", precision_score(y_trainval, logreg_tr_pred))
print(" ")
print("Test data")
print("Accuracy score: ", accuracy_score(y_test, logreg_test_pred))
print("f1 score: ", f1_score(y_test, logreg_test_pred))
print("recall score: ", recall_score(y_test, logreg_test_pred))
print("precision: ", precision_score(y_test, logreg_test_pred))
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
Accuracy=accuracy_score(y_test, logreg_test_pred)
f1score = f1_score(y_test, logreg_test_pred)
recallscore= recall_score(y_test, logreg_test_pred)
precision = precision_score(y_test, logreg_test_pred)
# adding vals to excel file that you created.
line = 'Logistic Regression' + ',' +str("penalty=l2 and c=10")+','+str(Accuracy )+','+str(f1score)+ ','+str(recallscore)+','+str(precision) +'\n'
f = open('classificationOutput_GROUP19.csv','a')
f.write(line)
f.close()
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
# predict probabilities
probs = log_reg.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(y_test, probs)
print( thresholds )
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
from sklearn.metrics import auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
# predict probabilities
probs = log_reg.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# predict class values
y_prd_class_val = log_reg.predict(X_test)
# calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, probs)
# calculate F1 score
f1 = f1_score(y_test, y_prd_class_val)
# calculate precision-recall AUC
auc = auc(recall, precision)
# calculate average precision score
ap = average_precision_score(y_test, probs)
print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap))
# plot no skill
pyplot.plot([0, 1], [0.5, 0.5], linestyle='--')
# plot the precision-recall curve for the model
pyplot.plot(recall, precision, marker='.')
# show the plot
pyplot.show()
The below model depicts the Linear SVM. The procedures GridSearchCV is used in the model to derive the best parameters i.e., C. The best parameter for the Linear SVM is C=10 and which leaves the model with an accuracy =0.9936305732484076 and precision of 1 stating that the model has very less misclassifications.
from sklearn.svm import LinearSVC
c_range= [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = dict(C=c_range)
print("Parameter grid:\n{}".format(param_grid))
clf = LinearSVC()
linearsvc_grid_search = GridSearchCV(estimator=clf, param_grid = dict(C=c_range) ,n_jobs=-1)
linearsvc_grid_search.fit(X, y)
linearsvc_grid_search.best_score_
linearsvc_grid_search.best_params_
clf_best = LinearSVC(C=10)
linearsvc_grid_search.best_params_
clf_best = LinearSVC(C=10)
clf_best.fit(X_trainval, y_trainval)
clf_tr_pred = clf_best.predict(X_trainval)
clf_test_pred = clf_best.predict(X_test)
print("Train data")
print("Accuracy score: ", accuracy_score(y_trainval, clf_tr_pred))
print("f1 score: ", f1_score(y_trainval, clf_tr_pred))
print("recall score: ", recall_score(y_trainval, clf_tr_pred))
print("precision: ", precision_score(y_trainval, clf_tr_pred))
print(" ")
print("Test data")
print("Accuracy score: ", accuracy_score(y_test, clf_test_pred))
print("f1 score: ", f1_score(y_test, clf_test_pred))
print("recall score: ", recall_score(y_test, clf_test_pred))
print("precision: ", precision_score(y_test, clf_test_pred))
pd.crosstab(y_trainval, clf_tr_pred)
pd.crosstab(y_test, clf_test_pred)
report = classification_report(y_test, clf_test_pred)
print(report)
The below model depicts the SVC Linear with Kernel. The procedures GridSearchCV is used in the model to derive the best parameters i.e., C . The best parameter for the SVC Linear with Kernel is C=1 , which leaves the model with an accuracy and precision of 1 stating that the model is a perfect fit without any misclassifications.
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score , GridSearchCV
c_range= [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = dict(C=c_range)
print("Parameter grid:\n{}".format(param_grid))
svc = SVC(kernel='linear')
grid_search = GridSearchCV(estimator=svc, param_grid = dict(C=c_range) ,n_jobs=-1)
grid_search.fit(X, y)
grid_search.best_score_
grid_search.best_params_
svc_best = SVC(C=1.0, gamma='auto',probability=True)
svc_best.fit(X_trainval, y_trainval)
svc_tr_pred = svc_best.predict(X_trainval)
svc_test_pred = svc_best.predict(X_test)
print("Train data")
print("Accuracy score: ", accuracy_score(y_trainval, svc_tr_pred))
print("f1 score: ", f1_score(y_trainval, svc_tr_pred))
print("recall score: ", recall_score(y_trainval, svc_tr_pred))
print("precision: ", precision_score(y_trainval, svc_tr_pred))
print(" ")
print("Test data")
print("Accuracy score: ", accuracy_score(y_test, svc_test_pred))
print("f1 score: ", f1_score(y_test, svc_test_pred))
print("recall score: ", recall_score(y_test, svc_test_pred))
print("precision: ", precision_score(y_test, svc_test_pred))
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
Accuracy=accuracy_score(y_test, clf_test_pred)
f1score = f1_score(y_test, clf_test_pred)
recallscore= recall_score(y_test, clf_test_pred)
precision = precision_score(y_test, clf_test_pred)
# adding vals to excel file that you created.
line = 'Linear SVC' + ',' +str("C=0.1")+','+str(Accuracy )+','+str(f1score)+ ','+str(recallscore)+','+str(precision) +'\n'
f = open('classificationOutput_GROUP19.csv','a')
f.write(line)
f.close()
pd.crosstab(y_trainval, svc_tr_pred)
pd.crosstab(y_test, svc_test_pred)
report = classification_report(y_test, svc_test_pred)
print(report)
### ROC curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
# predict probabilities
probs = svc_best.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(y_test, probs)
print( thresholds )
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
The below model depicts the SVC with kernel=rbf. The procedures GridSearchCV is used in the model to derive the best parameters i.e., C and gamma. The best parameter for the SVC with kernel=rbf is C=10 and gamma=0.5 , which leaves the model with an accuracy and precision of 1 stating that the model is a perfect fit without any misclassifications.
#from mlxtend.plotting import plot_decision_regions
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score , GridSearchCV
c_range= [0.001, 0.01, 0.1, 1, 10, 100]
gamma_range=[0.001, 0.05,0.07,0.03,0.01,0.5,0.3, 0.1, 1, 10, 100]
param_grid = dict(C=c_range, gamma=gamma_range)
print("Parameter grid:\n{}".format(param_grid))
svc = SVC(kernel='rbf')
grid_search = GridSearchCV(estimator=svc, param_grid = dict(C=c_range,gamma=gamma_range) ,n_jobs=-1)
grid_search.fit(X, y)
grid_search.best_score_
grid_search.best_params_
svc_best_rbf = SVC(kernel='rbf',C=1.0, gamma=0.5)
svc_best_rbf.fit(X_trainval, y_trainval)
svc_rbf_tr_pred = svc_best_rbf.predict(X_trainval)
svc_rbf_test_pred = svc_best_rbf.predict(X_test)
print("Train data")
print("Accuracy score: ", accuracy_score(y_trainval, svc_rbf_tr_pred))
print("f1 score: ", f1_score(y_trainval, svc_rbf_tr_pred))
print("recall score: ", recall_score(y_trainval, svc_rbf_tr_pred))
print("precision: ", precision_score(y_trainval, svc_rbf_tr_pred))
print(" ")
print("Test data")
print("Accuracy score: ", accuracy_score(y_test, svc_rbf_test_pred))
print("f1 score: ", f1_score(y_test, svc_rbf_test_pred))
print("recall score: ", recall_score(y_test, svc_rbf_test_pred))
print("precision: ", precision_score(y_test, svc_rbf_test_pred))
pd.crosstab(y_trainval, svc_rbf_tr_pred)
pd.crosstab(y_test, svc_rbf_test_pred)
#Report
report = classification_report(y_test, svc_rbf_test_pred)
print(report)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
Accuracy=accuracy_score(y_test, svc_rbf_test_pred)
f1score = f1_score(y_test, svc_rbf_test_pred)
recallscore= recall_score(y_test, svc_rbf_test_pred)
precision = precision_score(y_test, svc_rbf_test_pred)
# adding vals to excel file that you created.
line = 'SVC Kernel as rbf' + ',' +str("C=10 and gamma=0.5")+','+str(Accuracy )+','+str(f1score)+ ','+str(recallscore)+','+str(precision) +'\n'
f = open('classificationOutput_GROUP19.csv','a')
f.write(line)
f.close()
The below model depicts the SVC with kernel=poly. The procedures GridSearchCV is used in the model to derive the best parameters i.e., C and degree. The best parameter for the SVC with kernel=poly is C=10 and degree=1 , which leaves the model with an accuracy and precision of 1 stating that the model is a perfect fit without any misclassifications
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score , GridSearchCV
c_range= [0.001, 0.01, 0.1, 1, 10, 100]
degree_range=[1,2,3,4]
param_grid = dict(C=c_range, degree = degree_range)
print("Parameter grid:\n{}".format(param_grid))
svc = SVC(kernel='poly')
grid_search = GridSearchCV(estimator=svc, param_grid = dict(C=c_range,degree = degree_range) ,n_jobs=-1)
grid_search.fit(X, y)
grid_search.best_score_
grid_search.best_params_
svc_best_poly = SVC(kernel='poly',C=1.0, degree=1)
svc_best_poly.fit(X_trainval, y_trainval)
svc_poly_tr_pred = svc_best_poly.predict(X_trainval)
svc_poly_test_pred = svc_best_poly.predict(X_test)
print("Train data")
print("Accuracy score: ", accuracy_score(y_trainval, svc_poly_tr_pred))
print("f1 score: ", f1_score(y_trainval, svc_poly_tr_pred))
print("recall score: ", recall_score(y_trainval, svc_poly_tr_pred))
print("precision: ", precision_score(y_trainval, svc_poly_tr_pred))
print(" ")
print("Test data")
print("Accuracy score: ", accuracy_score(y_test, svc_poly_test_pred))
print("f1 score: ", f1_score(y_test, svc_poly_test_pred))
print("recall score: ", recall_score(y_test, svc_poly_test_pred))
print("precision: ", precision_score(y_test, svc_poly_test_pred))
pd.crosstab(y_trainval, svc_poly_tr_pred)
pd.crosstab(y_test, svc_rbf_test_pred)
#Report
report = classification_report(y_test, svc_poly_test_pred)
print(report)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
Accuracy=accuracy_score(y_test, svc_poly_test_pred)
f1score = f1_score(y_test, svc_poly_test_pred)
recallscore= recall_score(y_test, svc_poly_test_pred)
precision = precision_score(y_test, svc_poly_test_pred)
# adding vals to excel file that you created.
line = 'SVC Kernel as Poly' + ',' +str("C=10 and gamma=0.5")+','+str(Accuracy )+','+str(f1score)+ ','+str(recallscore)+','+str(precision) +'\n'
f = open('classificationOutput_GROUP19.csv','a')
f.write(line)
f.close()
The below model depicts the Decision Tree. The procedures GridSearchCV is used in the model to derive the best parameters i.e., max_depth. The best parameter for the Decision Tree is max_depth=4 , which leaves the model with an accuracy and precision of 1 stating that the model is a perfect fit without any misclassifications
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
param_grid = dict(max_depth=[4,6,8,10])
gs_dt = GridSearchCV(DT, param_grid=param_grid, cv=10, scoring='accuracy')
gs_dt.fit(X, y)
gs_dt.best_score_
gs_dt.best_params_
dt_best = DecisionTreeClassifier(max_depth=4)
dt_best.fit(X_trainval, y_trainval)
dt_tr_pred = dt_best.predict(X_trainval)
dt_test_pred = dt_best.predict(X_test)
print("Train data")
print("Accuracy score: ", accuracy_score(y_trainval, dt_tr_pred))
print("f1 score: ", f1_score(y_trainval, dt_tr_pred))
print("recall score: ", recall_score(y_trainval, dt_tr_pred))
print("precision: ", precision_score(y_trainval, dt_tr_pred))
print(" ")
print("Test data")
print("Accuracy score: ", accuracy_score(y_test, dt_test_pred))
print("f1 score: ", f1_score(y_test, dt_test_pred))
print("recall score: ", recall_score(y_test, dt_test_pred))
print("precision: ", precision_score(y_test, dt_test_pred))
fea_imp = dt_best.feature_importances_
columns = X_trainval.columns
feat_cols = pd.DataFrame({'name_col':columns,'feat_imp':fea_imp})
features=columns
importances = dt_best.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
pd.crosstab(y_trainval, dt_tr_pred)
pd.crosstab(y_test, dt_test_pred)
report = classification_report(y_test, dt_test_pred)
print(report)
### ROC curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
# predict probabilities
probs = dt_best.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(y_test, probs)
print( thresholds )
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
#### Precision and recall curve
X_train.columns
from sklearn.metrics import auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
# predict probabilities
probs = dt_best.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# predict class values
y_prd_class_val = dt_best.predict(X_test)
# calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, probs)
# calculate F1 score
f1 = f1_score(y_test, y_prd_class_val)
# calculate precision-recall AUC
auc = auc(recall, precision)
# calculate average precision score
ap = average_precision_score(y_test, probs)
print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap))
# plot no skill
pyplot.plot([0, 1], [0.5, 0.5], linestyle='--')
# plot the precision-recall curve for the model
pyplot.plot(recall, precision, marker='.')
# show the plot
pyplot.show()
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
Accuracy=accuracy_score(y_test, dt_test_pred)
f1score = f1_score(y_test, dt_test_pred)
recallscore= recall_score(y_test, dt_test_pred)
precision = precision_score(y_test, dt_test_pred)
# adding vals to excel file that you created.
line = 'Decision Tree' + ',' +str("max_depth=4")+','+str(Accuracy )+','+str(f1score)+ ','+str(recallscore)+','+str(precision) +'\n'
f = open('classificationOutput_GROUP19.csv','a')
f.write(line)
f.close()
Random forest is used for feature selection. This is only for the reference.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
estimator = [20,50,70]
max_features_val= [10,15,20]
param_grid = dict(n_estimators=estimator, max_features=max_features_val)
print(param_grid)
rf_r = RandomForestClassifier()
rfgs = GridSearchCV(rf_r, param_grid = param_grid, cv=10, scoring='r2')
rfgs.fit(X,y)
rfgs.best_score_
rfgs.best_params_
rf_r_best = RandomForestRegressor(n_estimators= 20,max_features= 20 )
rf_r_best.fit(X_trainval,y_trainval)
features = rf_r_best.feature_importances_
features
cols = X_trainval.columns
cols
feat_cols = pd.DataFrame(data=[cols,features])
feat_cols
features=cols
importances = rf_r_best.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score
Accuracy=accuracy_score(y_test, dt_test_pred)
f1score = f1_score(y_test, dt_test_pred)
recallscore= recall_score(y_test, dt_test_pred)
precision = precision_score(y_test, dt_test_pred)
# adding vals to excel file that you created.
line = 'Random Forest'+',' +str("n=20")+','+str(Accuracy )+','+str(f1score)+ ','+str(recallscore)+','+str(precision)+','+str('max_features= 20')+'\n'
f = open('classificationOutput_GROUP19.csv','a')
f.write(line)
f.close()
Polynomial model is the best model derived of all the models that has been fitted.
All of the classification models give us error free values. Random forest classifier is an efficient for feature selection. The CSV file has all the values as the output for both regression and classification.